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1. INTRODUCTION 

Every year in Tay Ninh province, the data entry of the entrance exam into the 10th grade of Tay 
Ninh high schools is done according to the typing process through the Department of Education and Training 
database interface of Tay Ninh. This province has ten schools that organize exams and admissions, so the 
annual data entry of student files across the entire province includes: 10835 files, the first phase is 4,000 files, 
the second phase is 6,835 files [1]. The data entry for students is all manually entered. From elementary 
schools to middle schools, from middle schools to high schools, and from high schools to the high school 
graduation exam, these data are re-entered every year. After being entered and stored, the data of each grade 
level is only used for the school years of that grade. When transferring files to another school level, these 
data are completely re-entered without inheritance. This problem costs the province's labor resources, time, 
and expense. Therefore, the creation of a system to support the digitization of candidates’ registration forms 
is necessary to serve the entrance exam to high school in Tay Ninh province [2]-[4]. 


2. RELATED WORKS 

Jaramillo et al. [5] presented the problem of processing offline handwritten text recognition 
handwriting text recognition (HTR) with reduced training data sets. Recent HTR solutions based on artificial 
neural networks show remarkable solutions in referenced databases. These deep neural networks include 
convolutional neural networks (CNNs) and long short-term memory (LSTM). In addition, connectionist 
temporal classification (CTC) is the key to avoiding character-level segmentation, greatly facilitating the 
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labeling task. In 2018, Nguyen ef al. [6] created an unconstrained Vietnamese online handwritten text 
database sampled from pen-based devices. The database stores handwritten text for paragraphs, lines, words, 
and characters, with the ground truth associated with every paragraph and line. We show detailed statistical 
analysis of handwritten text in this database and describe recognition experiments using several recent 
methods, including the bidirectional long short-term memory (Bi-LSTM) network. Overall, our database 
contains over 480,000 strokes from more than 380,000 characters, currently the largest database of 
handwritten documents online in Vietnam [7]. 

Nguyen et al. [8] mentioned convolutional recurrent neural networks (CRNNs) excel at scene text 
recognition. However, this model suffers from vanishing/exploding gradient problems when processing long 
text images, commonly found in scanned documents. This problem poses a significant challenge to overcome 
the goal of completely solving the optical character recognition (OCR) problem. Inspired by recently 
proposed memory-augmented neural networks (MANNs) for long-term sequential modeling, they introduced 
a new architecture called convolutional multi-way associative memory (CMAM) to address limitations of 
current CRNNs. Their architecture, which takes advantage of recent memory access mechanisms in MANNs, 
demonstrates superior performance over other CRNN counterparts in three real-world long-text OCR 
datasets. In addition, this paper reports new state-of-the-art IAM-OnDB results for both open and closed 
dataset settings. The system combines methods from sequence recognition with a new input encoding using 
Bézier curves. This combination leads to up to 10 times faster recognition than our previous system. Through 
a series of experiments, they determine the optimal configuration of their models and report the results of 
their setup on several additional public datasets. Additionally, in 2020, Carbune et al. described an online 
handwriting system that can support 102 languages using a deep neural network architecture. This new 
system has completely replaced our previous segment-and-decode-based system, reducing the error rate by 
20-40% relative to most languages [9]. 


3. PROPOSED METHOD 
3.1. Overview 

Vietnamese handwriting recognition is much more complicated than print recognition because it 
varies widely depending on the writer, writing direction, speed and writing pressure. Although handwriting 
studies have made remarkable achievements, the recognition efficiency is not high compared to other 
recognition fields [10]-[13]. Therefore, this field of identification poses many potentials and is also a 
challenge for our research [14]. The article presents the method from normalizing the collected data, 
detecting the handwriting text container of the image and the model training process, the OCR Vietnamese 
handwriting recognition method using the CRNN model [15]. 

The general model for extracting and recognizing handwriting to extract information from the 10th- 
grade enrollment form in Tay Ninh province is shown in Figure 1. This model consists of 3 main parts: 
region extraction (Cropper), character extraction (Text detection), and string identifier (OCR). These are 
three problems that need to be solved. 


The enrollment 


is Text 
form image 


- Recognizer 
Detection 


has been ak OCR 


<a. 


Figure 1. Overview model 


3.2. Region extraction 

We propose an algorithm to extract the data area from the scanned enrollment form image by only 
taking the critical information area, as shown in Figure 2, to remove unnecessary information. We then 
separate the information region into three regions called A, B, and C, shown in Figure 3(a) is region A, 
Figure 3(b) is region B, and Figure 3(c) is region C, to improve the performance of the extraction process 
information region when applying the efficient and accurate scene text (EAST) deep learning model [16] to 
the areas separated. Since the information in the form contains a scoreboard in region B, the table lines are 
noisy, causing difficulties in region extraction and character recognition. Therefore, the separation into three 
separate zones helps to improve the work efficiency. The enrollment form of Tay Ninh province is fixed, so 
we can separate these three regions based on heuristic thresholds. 
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A. PHAN GHI Li LICH VA KET QUA HOC TAP: 

- Ho va ten: PHAM NGUYEN DANG IGH0A Gist tink = pam Din toe =_ Lab — 
(He, chit 46t var tén thi sinh viet chit in hoa) 
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Figure 2. The main information container of the 10" grade enrollment form 
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Figure 3. Three separate regions; (a) area for recording student background information (Region A), 
(b) area for recording learning and training results (Region B), and (c) area of application registration 
(Region C) 


3.3. Character extraction 

The deep learning method has the advantage of automatically learning features from the input 
information of the problem [17], [18]. We first apply the EAST model to detect text areas and create text 
boxes for image areas containing handwriting. This model is a powerful deep learning method used to detect 
texts presented on input images. It can find horizontal and rotated bounding boxes and can be used with any 
other text recognition method. The text detection system with EAST has eliminated redundant and 
intermediate steps and has only two stages. EAST uses a fully integrated network to generate text prediction 
words or lines directly. The generated predictions that can rotate the rectangle or the quadrilateral are further 
processed through the suppression step to yield the final output. 

The EAST algorithm detects texts in the input image by creating a text box for each word or phrase, 
lead to many rectangular boxes for the detected words. Algorithm 1 is an algorithm to join the text box in 
each row to process image regions. As a result, the input image has many rows of information, and the output 
image also has many rows of text boxes. The output text lines are fed into OCR system in next step. Figure 4 
illustrated results for the algorithm to join text boxes by row. In the final step of the EAST model to detect 
text boxes, based on advanced information such as fixed form, ratio of each field, position of each field, 
We apply heurictic thresholds to separate large_boxes into separate fields per row to help the training step of 
data and other methods. The text box shows the correct semantics in the 10"-grade enrollment form in Tay 
Ninh province as shown in Figure 5. 
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Algorithm 1: Algorithm to join text boxes row by row 
Input: coordinates of the boxs put into stand boxes. 
Output: coordinates large box = [(x_ min, y min, x max, y max)], As a result, the boxes are 
joined in rows. 
1. Put all the boxes put into stand boxes [ ] 
van Calculate the coordinates of the midpoint (y coordinates) of the Textbox to 
identify the Textboxes belonging to the same line, then sort the group boxes in 
ascending order (y). Next, put them (the child group_box has been arranged as) 
into the same parent group boxes 
os Calculate (length) of group_boxs 
4. Loop group _box (from 0 to group_boxs) to calculate the coordinates of each 
text_box: 
x min, X_max, 
y_min, y_max 


5s Calculate large box = [(x_min, y min, x_max, y max)] {where are the top - left, 
bottom - right coordinates of each Textbox} 
6. Draw the large boxes according to the calculated coordinates, 


End the algorithm. 


ple QUR Le 


dipvaee: IGUYEN. Ht Ta40 Tents Nabi we Karo | 


Dien thoai (bar bude = Bhip: CISFELSA 


a en SAS ERSURRI enDTA tr RED TEEY Ul 


Figure 4. Example of results of text box connection Figure 5. Example of results of separated fields of 
by row of area A area A 


3.4. The string identifier 

We propose a method to solve the OCR character recognition problem using CRNN and 
attention models to recognize Vietnamese handwriting in the 10-grade enrollment form of Tay Ninh 
province [19]-[22]. The CRNN network model is a popular model that gives good results in print and 
handwriting recognition [23]. We have trained a CRNN model for Vietnamese handwriting recognition 
problems using the OCR technique with the dataset processed from the enrollment form. At the same time, 
we also provide a CRNN model for feature extraction and handwriting recognition, as shown in a Figure 6, 
trained on the enrollment form data set achieved relatively good results. 

The CRNN model for the handwriting recognition problem presented in this paper consists of 2 
parts: CNN and RNN + LSTM [24], [25]. Precisely, CNN extract features from the image. Therefore, the 
architecture of the CNN block must be suitable to receive input of size wxh. We place the output of the CNN 
block as the input of the RNN + LSTM block. 


CTC + ATTENTION 


(LE TTY NT 


Figure 6. An identity pattern of CRNN 
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3.5. Model operation 

The input image through the CNN block: the visual geometry group (VGG) component removes 
noise, reduces the dimensional space, and extracts features for output in the form of feature vectors (Feature 
map). Next, the RNN and LSTM block consist of two main components: RNN encoder and RNN decoder. 
RNN encoder helps to process the encoding features, RNN decoder as a decoder to process the output. 
Finally, CTC and ATTENTION improve the output by removing repeated characters and blanks (blank tokens) 
to produce a complete sentence. This problem is called output alignment or alignment problem [26]—[29]. 


3.6. Train the proposed network model 

After locating the information areas to be extracted using the EAST model, we train the data against 
the proposed model. Due to the data collected, there are certain limitations described in section 4- 
experiments. So, in training data to produce an identification model to solve the OCR problem mentioned, we 
propose removing four fields (do not train data for these fields) such as conduct, graduation year, candidates 
for recruitment and school to register for the exam because the value is little changed. It does not guarantee 
the comprehensiveness of information in reality. Table 1 mentioned and specified the reason for the rejection. 


Table |. Ratio of data divided by fields 


Numerical . Total number of Number of images for training Number of photos to test 
Name fields : : 
order experimental images 80% 20% 
Total 1550 1240 310 
1 Full name 307 246 61 
2 Sex 39 31 8 
3 Date of birth 93 74 19 
4 Class 87 70 17 
5 Secondary School 113 90 23 
6 District (city) 90 72 18 
ej Current accommodation 113 90 23 
8 Phone 57 46 11 
9 Academic ability 191 153 38 


Grade Point Average for 


me the whole year 468 14 a3 

ll Graduation High School 90 oD) 18 
Graduation 

12 Priority Beneficiaries 113 90 23 

13 Plus mark 92 74 18 


We apply the word error rate (WER) measure to evaluate the word error rate in the data recognition 
of the trained model. After we get the results when using the built model to train and test the evaluation from 
the data set, we found that the recognition rate of handwritten Vietnamese characters is still low. Therefore, a 
spelling correction method was applied to improve the recognition rate. Spelling correction idea uses a result 
set that identifies incorrect results but approximates the correct results to compare with the complete data set, 
which is the fully collected contraints for comparison. We performed the spelling correction algorithm, 
experimented on the data fields and gave the following improved results. 

— Fields that can correct spelling errors such as: Place of birth (name of province/city), district, secondary 
school name, gender, ethnicity, priority category, academic ability. Because these fields can collect all 
its occurrences. 

— Fields that cannot be corrected include: Date of birth, phone number, full name, grade, grade point 
average. Since these fields have very large instances and possibly infinite numbers, complete statistics 
are not possible. 


4. RESULTS AND DISCUSSION 
4.1. Dataset and implementation details 
4.1.1. Data collection and pre-processing 

We use the dataset collected from the 10"-grade enrollment forms of three schools that organize the 
entrance exam (Nguyen Chi Thanh, Le Quy Don, and Tran Dai Nghia High School) in Tay Ninh province. 
The dataset was collected by scanning images of registration forms. Each form consists of 3 regions 
containing the following information: Region A-contains the personal information. Region B-includes a table 
of results of study and practice in four school years of candidates. Region C-contains registration information 
for grade 10 and corresponding priority points. Using the algorithm to separate the image into three regions 
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has an accuracy of 100%. We used all 1550 images divided into 1240 images for training, 310 images for 
testing and evaluation. The data is divided according to the fields, as shown in Table 1. 

For better training results, we have separated the data into separate fields shown in Table 2. Those are 
the input image files for the training of the network model. The output data is a text file containing the following 
information: full name, sex, date of birth, grade secondary school, district (city), current address, phone, 
academic ability, average mark of the whole year, valuation high school graduation, priority, inherited priority. 


Table 2. Set of sample images divided by field for experiment 


Numerical order Name fields Sample images have been separated 
1 Full name M Ad Quée ANE aa 
2 Sex Dam 
3 Date of birth Ad (AA! 2005 
4 Grade IQA 
5 Secondary School he rh Te ot 
6 District (city) doa Thanh 
7 current address Cl i, ing ig a hi, oly 
8 Phone OAIBAAACTS 
9 Academic ability i a 
10 Average mark of the whole year ZO. 
11 Valuation High School Graduation Acie! 
12 Priority Dan We AB AA MEV RN 
13 inherited priority 4 


— 


4.1.2. Challenges of the collected data 

Image quality much depends on the means of scanning equipment when collecting, the technique of 
taking pictures, the ambient light and the paper material of the information form, which greatly affects the 
image of the collected data. As well as the quality between the form images there are also differences. The big 
problem is Vietnamese hand-writing and the form has many fields, the information inside also depends on many 
factors such as: different writers, different types of pens, writing direction, and light density, character 
sharpness, writing speed and various scribbles. create the difficulty of the data set for the research problem. 


4.2. Experimental environment 

After collecting and normalizing the noise type of the dataset, we implemented and built the selected 
algorithms. We implemented the algorithms in Python 3.7 programming language with the configuration of 
training computer, testing CRNN model for handwriting recognition such as: Computer type-Lenovo, OS- 
Windows 10, Architecture-OS 64-bit, CPU-Intel Core i9 10900x 3.7g up 4.7g | 10 core | 20 thread, RAM-Gskill 
Trident Z RGB 128g/3600 (4x36g), HDD-SSD Samsung 970evo 1TB nvme m.2 pcie, Graphics Adapter- 
Graphics card that supports image processing: GPU 64GB, 128-bit- VGA: 2 x NVIDIA RTX 3090 24g Gddr6x. 


4.3. Results of information detection based on EAST model 

After applying the EAST model, the extraction results on the three regions reached the accuracy as 
shown in Table 3. Figure 7 shows the extraction ratio of the region containing text in the image of 3 regions 
A, B, and C. For region B, two algorithms are used (1. normal image processing and 2. deep learning 
algorithm with EAST model) [30]. 

Analysis: when creating a text box using the deep learning model-EAST, the detection rate of the 
region containing the text in the image is relatively high for regions A and C (87% and 83%). However, for 
region B (the region that has the learning results table of the student form), the accuracy is 54%, nearly 30% 
lower than in regions A and C. With conventional image processing algorithms for region B (with the same 
input data type), the result is 63%, higher than that of the EAST model (9%). The EAST model gives bad results 
for region B because this is a table. The table lines are noisy data that significantly affect the results of EAST. 
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Table 3. Result of detection rate for 3 regions A, B, C 


Rumba of Total fields Correct Wrong Ratio 
test images 
Region A uses deep learning model-EAST 30 10 field x 30 image =300 260 40 87% 
Region B uses conventional image processing algorithms 30 14 field x 30 image =420 265 155 63% 
Region B uses deep learning model-EAST 30 14 field x 30 image =420 226 194 54% 
Region C uses deep learning model-EAST 30 4 field x 30 image =120 100 20 83% 


Results of extracting the region containing text characters 


Correct ratio 


Region detection rate 


Incorrect ratio 


Correct ratio 


Of Incorrect ratio 


Extracted regions 


Figure 7. Extraction ratio of the region containing text in the image of 3 regions A, B, and C 


4.4. Results of Vietnamese handwriting recognition using CRNN 

Table 4 shows the results and using the OCR technique to check the recognition on a dataset with 
1550 images, including 1240 train images and 310 test images. Table 5 shows the test results of correctly 
identified image regions, and Table 6 shows the results of wrongly recognized images after testing. Through 
the statistics of the results of training and evaluation, the OCR technique achieved an excellent rate with the 
WER measure of 36.02%. Each image has an average size of 32x525 with a recognition processing time of 
about 0.0471s. The total processing time of 310 images with an average size of 32525 is 13,6214s. 


Table 4. OCR training and identification results using the WER measure 
Total number of experimental images _ Number of training images _ Number of testimages _ WER (%) 
1550 1240 310 36,02 


Table 5. Illustrated correctly recognized image region 
Name Image Label Final CTC ___ Attention 


A_distric_010.jpg +4 = GZ 4 co Hoda Thanh HodaThanh HA Hoda Thanh 


A_birthday_004.jpg aa _/ A4/ JOOS 22/11/2005 = 22/11/2005 /11/k = 22/11/2005 


Table 6. Illustrated image region misidentified 


Name Image Label Final CTC Attention 


a_name_test_084,jp¢ PAN THI KIM CUSNG. PHANTHIKIMCUONG PHAMTHITH = Px — PHAM THI TH 


a_shool_test_084.jpg Sim Bi tee f An Binh Dan téc rat Y Dan téc rat 


5. CONCLUSION 

This paper proposes a deep learning method to separate regions-EAST by applying the CRNN deep 
learning network model and OCR Vietnamese handwriting recognition technique on the 10"-grade 
enrollment form in Tay Ninh province. We implement a spelling correction algorithm to increase the 
efficiency of data recognition. Experiment results show that our model effectively utilizes data digitization in 
the education sector, potentially saving the provincial budget and human force, reducing data entry time, 
especially in many student records. 
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